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^^ Abstract. We analyse the potential of Gibbs Random Fields for shape prior mod- 

elling. We show that the expressive power of second order GRFs is already suffi- 
cient to express simple shapes and spatial relations between them simultaneously. 
This allows to model and recognise complex shapes as spatial compositions of 
simpler parts. 
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1 Introduction 



> 

C) Motivation and goals Recognition of shape characteristics is one of the major aspects 

r/. of visual information processing. Together with colour, motion and depth processing it 

O forms the main pathways in the visual cortex. 

Experiments in cognitive science show in a quite impressive way, that humans 
y^^ recognise complex shapes by decomposition into simpler parts and interpreting the for- 

^ mer as coherent spatial compositions of these parts [6]. Corresponding guiding prin- 

t^^ ciples for the decomposition where identified from these experiments as well as from 

^^ research in computer vision (see e.g. [8]). The formulation of these principles relies 

_i however on the assumption that the objects are already segmented and thus concepts 

like convexity and curvature can be applied. 

From the point of view of computer vision it is desirable to use shape processing and 

modelling in the early stages of visual processing. This allows to control e.g. segmen- 

T-H tation directly by prior assumptions or by feedback from higher processing layers. This 

L) leads to the question whether composite shape models can be represented and learned 

• ^H in a topologically fully distributed way. The aim of the presented work is to study this 

/\^ question for probabilistic graphical models. 

Related work All mathematically well principled shape models for early vision can be 
roughly divided into the following two groups. 

Global models treat shapes as a whole. Prominent representatives are variational 
models and level set methods in particular. A shape is described up to its pose by means 
of a level set function defined on the image domain. Cremers et.al. have shown in [2] 
how to extend these models for scene segmentation. Recently we have shown how to 
use level set methods in conjunction with MRFs [3]. Global shape models are well 
suited e.g. for segmentation and tracking if the number of objects is known in advance 
and a good initial pose estimation is provided. 
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Semi global models consider shape characteristics in local neighbourhoods and go 
back to the ideas of G. Hinton on "product of experts" as well as of Roth and Black 
on "fields of experts" (see [5, 10] and citations therein). Mathematically these models 
are higher order GRFs of a certain type - additional auxiliary variables are used to 
express mixtures of local shape characteristics in usually overlapping neighbourhoods. 
Marginalisation over these auxiliary variables results in GRFs of higher order The work 
of Kohli, Torr et.al. [9, 7] demonstrates how to introduce such higher order Gibbs po- 
tentials directly and to use them for segmentation in hierarchical Conditional Random 
Fields (CRF). However, it is not clear how to learn the graphical structure for such 
models. 



Contributions We will show that Gibbs Random Fields of second order have already 
sufficient expressive power to model complex shapes as coherent spatial compositions 
of simpler parts. Obviously, these models have to have a significantly more complex 
graphical structure than just simple lattices. Moreover, the graphical structure itself be- 
comes a parameter which has to be learnt together with the Gibbs potentials for each 
considered shape class. 

From the application point of view these models have advantages especially in the 
context of scenes with an unknown number of similar objects (i.e. all objects are in- 
stances of a single shape class). Moreover, such models can be easily combined for 
scenes with instances of different shape classes. 

The structure of the paper is as follows. In section 2 we introduce the GRF model 
for composite shapes and discuss the inference and learning tasks. The latter means to 
learn the Gibbs potentials and the graphical structure itself. Section 3 gives experiments 
exploring the expressive power of the model - first we separately show its ability to 
express spatial relations between segments and its ability to model simple shapes. Then 
we demonstrate its capability to model composite shapes including structure learning. 
Finally, we show how to combine such models for the discrimination of shape classes. 



2 The shape model 

Probability distribution We begin with the description of the prior part of our shape 
model. Let I? C Z^ be a finite set of nodes t E D, where each node corresponds 
to an image pixel. Let A C Z^ be a set of vectors used to define a neighbourhood 
structure on the set of nodes, i.e. a graph: two nodes t and t' are connected by an edge if 
t' — t = a G A. To avoid double edges we require —ADA = (we use unary potentials 
as well). The resulting graph is obviously translational invariant and the elements of 
a £ A define subsets Ea C E of equivalent edges, where e = {t, t') G Ea if t' —t = a. 
A simple example is shown in Fig. 1. 

Given a class of composite shapes, we denote the set of its parts enlarged by an extra 
element for the background by K. A shape-part labelling y: D —>■ K is a mapping, 
that assigns either a shape-part label or the background label yt E K to each node 
t G D. A function Ua : K x K ^ Ris defined for each difference vector a E A. Its 
values Ua{k, k') are called Gibbs potentials. A corresponding probability distribution is 
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Fig. 1. Left: example of a translational invariant graphical structure. Equivalence classes of edges 
Ea are coloured by different colours. The set A is represented by bold edges outgoing from the 
central node. Right: Gibbs potentials for an edge from Ea- 



defined over the set of shape-part labellings as follows 
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where Z denotes the partition sum (we omit the unary terms for better readability). 
This p.d. is homogeneously parametrised - all edges in an equivalence class Ea have 
the same potentials. 

Remark 1. Note that the parameters Ua of this model are unique up to additive constants 
for a given p.d. under fairly general assumptions - the only possible equivalent trans- 
formations (aka re-parametrisations) consist in adding a constant Uai) = «„()+ const. 
This will be shown in appendix A. Therefore, we assume from here that the Gibbs 
potentials for each a G A are normalised to sum to zero: ^j, j,, Ua{k, k') ~ 0. 

Remark 2. It is important to notice that a homogeneously parametrised GRF on a finite 
domain D <z1? is not necessarily homogeneous. A p.d. p{y) for labellings y: D -^ K 
is called homogeneous if its marginals for congruent subsets coincide. This inhomo- 
geneity, if present, usually reveals at the domain boundary. It is easy to verify that the 
converse is true at least for chains: a homogeneous Markov model on a finite chain 
admits a homogeneous parametrisation. 

The appearance model is assumed to be a "simple" conditional independent model. 
The probability to observe an image x: D -^ C (C is some colour space) given a 
shape-part labelling y is 

pi^ I y) = Ylpi^t I yt)- (2) 

teD 
In the light of the current popularity of CRFs it might well be asked, why we decided 
to favour a GRF here. Both variants are identical with respect to inference. Differences 
occur for learning. We can imagine that shape-part labellings can be used as latent 
variable layers for complex object segmentation models. Recently, empirical risk min- 
imisation learning has been proposed for structured SVM models with latent variables 
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[13]. This shows that learning of graphical models with latent variables is possible for 
both variants - GRFs and CRFs. However, since we want to study the expressive power 
of the model in its pure form, we need a prior p.d. and moreover, we want to be able to 
learn such models fully unsupervised, which is possible for GRFs but not for CRFs. 

The inference task Informally, the inference task can be understood as follows. Given 
an observation (i.e. an image), it is necessary to assign values to all hidden variables. We 
pose the segmentation task as a Bayesian decision task. Let y' be the true (but unknown) 
segmentation and C(y, y') be a loss function, that assigns a penalty for each possible 
decision y. The task of Bayesian decision is to minimise the expected loss 

■R(y; 2;) == V" v{y' I ^)C{y-, y') -^ min . (3) 

Z_^ y 

y 
We use the number of misclassified pixels 

c{y,y')^Y.Hyt^yt} (4) 

f 
as the loss function. It leads to the max-marginal decision 

yl = maxpOyt — k \ x) Vt E D. (5) 

Hence, it is necessary to calculate the marginal posterior probabilities for each node 
t G D and label k E K. Currently this task is infeasible for GRFs. Several approx- 
imation techniques based e.g. on belief propagation or variational methods have been 
proposed for this task (see e.g. [12] for an overview). Unfortunately none of them guar- 
antees convergence to the exact values of the sought-after marginal probabilities. To our 
knowledge, the only scheme which does it is sampling, which is however known to be 
slow [11]. 

Estimation of Gibbs potentials The learning task comprises to estimate the unknown 
model parameters given a learning sample. We assume that the latter is a random realisa- 
tion of i.i.d. random variables, so that the Maximum Likelihood estimator is applicable. 
The following situations are distinguished depending on the format of the learning 
data. If the elements of the sample have the format {x, y) then the learning is called 
supervised. If, instead, they consist of images only then the learning is called fully 
unsupervised. To cope with variants in-between as well, i.e. partial labellings yv, we 
consider the elements of the training sample to be events of the type B = {x,yv) = 

{{y,x) I y\v ^yv}- 

We start with the learning of unknown potentials u. For simplicity we consider the 
case when only one event B is given as the training sample. According to the Maximum 
Likelihood principle, the task is 

p(^; ") = yZ piy)p(My) -^ ^^^ ■ (6) 

yeB 
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Taking the logarithm and substituting the model (1), (2) gives 

L(u) = log^exp[^^ Y^ Ua{yt, ye) p{x\y)-log{Z (u)) -i' max. (7) 

yeB aeAWeEa 

It is easy to prove, that the derivative with respect to the potentials is a difference of 
expectations of some random variable na{k,k';y) with respect to the posterior and 
prior p.d. 

dL/dua{k,k') =Ep(j,|e.„)[na(fc, /c';y)] -Ep(y.„)[na(fc, A:';y)]. (8) 

The random variables na{k, k'; y) are defined by 

na{k,k';y)^ ^ l{yt^k,yt'^k'} (9) 

u'eEa 

and represent co-occurrences for label pairs {k,k') along the edges in Ea for a labelling 
y. Combining these random variables into a random vector <P, the gradient of the log- 
likelihood can be written as 

VL{U) = Ep(y\B;u) [<^] - Ep(j;;„) [<P] ■ (10) 

The exact calculation of the expectations in (8) is not feasible. Therefore, we pro- 
pose to use a stochastic gradient ascent to maximise (7). The learning algorithm is an 
iteration of the following steps: 

1. Sample y and y according to the current a-posteriori probability p{y\B; u) and a- 
priori probability p{y; u) respectively. 

2. Compute na(k, k'; y) and na(A:, fc'; y) by (9) for each a E A, k,k' (z K. 

3. Replace the expectations in (8) by their realisations and calculate new potentials u. 

For the sake of completeness we would like to mention that the learning of the 
appearance models p{c\k) can be done in a very similar manner. It is even simpler from 
the computational point of view because the normalising constant Z does not depend on 
these probabilities. Therefore it is not necessary to sample labellings according to the 
a-priori probability distribution p{y). Only a-posteriori sampled labellings are needed 
to perform the corresponding stochastic gradient step. 

Estimation of the interaction structure A very important question not discussed so 
far is the optimal choice of the neighbourhood structure A. Unfortunately, no well 
founded answer to this question is known at present. One option is to use an abun- 
dant set of interaction edges, e.g. to assume that the set A consists of all vectors A = 
{a & 1? \ |ai|, \a2\ ^ d] within a certain range. Despite of the computational com- 
plexity this would lead to models with high VC dimension and possibly - as a result - 
to weak discrimination. It is therefore important to investigate the possibility to iden- 
tify the neighbourhood structure A from a given training sample. A possible variant 
of a corresponding formal task reads as follows. Given a training sample the task is 
to find the best neighbourhood structure A of given size \A\ — m according to the 
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Maximum Likelihood principle L{ua,A) -^ max„^.^. This task is however not fea- 
sible - an exhaustive search over all possible sets A would be computationally pro- 
hibitive, and, moreover, the likelihood can be calculated only approximatively. There- 
fore we rely on a greedy approximation which we will consider in two variants - one 
of them successively includes new elements into the neighbourhood structure starting 
from A — {0} and the other successively removes elements from this structure starting 

from ^ = {a e Z^ I |ai|, laaK 4- 

For the first variant we use a greedy search for the interaction edges proposed by 
Zalesny and Gimel'farb in the context of texture modelling [14,4]. Starting from the 
set A — {0}, i.e. a model with unary potentials, new edges are iteratively chosen and 
included into A as follows. First, the optimal set of potentials u'\ E Ua is determined 
for the current set A as described in the previous subsection. Here Ua denotes the 
subspace of potentials on the edges in A (we may assume that the Gibbs potentials are 
zero on all other edges). If a bigger neighbourhood A' is considered, then clearly, the 
gradient of the (log) likelihood with respect to ua' in the point u\ will be orthogonal 
to the subspace Ua- The proposal is to include the vector a' E A' \A with the largest 
gradient component 

a' = argmax\^[no(fc, k';B,u) — na{k, k']u)] (11) 

Optionally the Kullback-Leibler divergence can be used instead of the Euclidean dis- 
tance. 

The second variant of structure estimation proceeds in opposite order. Starting with 
the neighbourhood structure A = {a &1? \ |ai|, |a2| ^ d], elements of A are succes- 
sively removed. The aim is to remove in each step the element with the smallest impact 
on the maximal likelihood 

niaxL(u^) — mayi L{u A\a) ^^ m.in . (12) 

UA UA\a aeA 

It is impossible to estimate this expression in the point u\ — argmax^^ L{ua) using 
the gradient of the likelihood (like in the first variant) because of WL{u*^) = 0. It is 
nevertheless possible to estimate this expression based on u\. For the sake of simplicity 
we show this for the situation of supervised learning. The likelihood maximisation with 
respect to the Gibbs potentials reads 



max 

UA 



i^{lpA,UA)-log'^CXp{(l>A{y),UA)j (13) 



for this case. Here we have used the following notations. The set of all Gibbs potentials 
Ua{-, ■), a E Ais considered as a vector ua- A realisation of the random vector <Pa (see 
(10)) is denoted by 4>A{y)- Finally, -ipA denotes the corresponding vector of statistics 
resulting from the training sample. Designating \og Z{ua) by H{ua), the expression 
in (13) is nothing but the Fenchel conjugate H*{i})a)- It is known that for exponential 
families the latter can be written as 

H*{ijA) = inf{^p(y)logp(2/) | Ep[<Z>^] = ^a, P e v] (14) 
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(see e.g. [12, 1]), where we denoted the expectation w.r.t. a probability distribution p 
by Ep and the set of all probability distributions on labellings y by V. This means to 
find the p.d. with maximal entropy among all distributions having expectation ip^ of 
the random vector <Pa ■ 

Removing an element a from the neighbourhood structure A can be equivalently 
expressed by the linear constraints Ua ^ 0. Considering the task (13) with these addi- 
tional constraints, it can be shown by the use of Fenchel duality (see e.g. [1]) that the 
corresponding conjugate function H* (i/ja) can be written as 

h*{iPa) = iiifinf{^p(y)iogp(2/) I Epi^A] ^i^A + Za,pe r}, (15) 

y 

where Za is an arbitrary vector of the subspace Ua- Therefore, the difference in (12) is 
equal to H* (ijja) — H* (iPa) = H* (iPa) — H* {ipA + z^) and can be estimated by the 
gradient of H* in tpA- The latter gradient is nothing but the vector of Gibbs potentials 



Remarks. The convex, lower semi-continuous function H*{tpA) is not differentiable 
in general. Therefore its sub-differential may consist of more than one subgradient 
UA- This corresponds to the non-uniqueness of the Gibbs potentials. We have how- 
ever shown that the Gibbs potentials are unique up to additive constants for the model 
class considered in this paper (see Remark 2 and Appendix A). 

Summarising, the difference in (12) can be estimated by ||ua||, what leads to the fol- 
lowing greedy removal strategy for elements of the neighbourhood structure A. Given 
a current neighbourhood structure A, estimate the optimal Gibbs potentials u*^ and re- 
move the the element a E A with the smallest value of || Wall. 

3 Experiments 

Modelling spatial relations between segments The first experiment investigates the 
ability of the model (1), (2) to reflect spatial relations between segments, i.e. scene parts, 
which are too large to capture their shape by a neighbourhood structure of reasonable 
size. We used the three images shown in the first row of Fig. 2 as training examples. 
Each scene should be segmented into three segments: K = {sky, trees, grass}. The 
appearance models p{c\k) for the segments were assumed as mixtures of multivariate 
Gaussians (four per segment). A model with "full" neighbourhood structure - all vec- 
tors {a G Z^ I |ai|, |a2| < d} with d ~ 20 was used in this experiment. A "simple" but 
anisotropic Potts model on the 8-neighbourhood was chosen as a baseline for compari- 
son. 

Semi-supervised learning was applied by fixing the segment labels in the rectangu- 
lar areas shown by red rectangles during learning. Both the a-priori models (the poten- 
tials and the direction specific Potts parameters for the baseline model) and the appear- 
ance models (mixture weights, mean values and covariance matrices) were learned. 

The difference of the models can be clearly seen by observing labellings generated 
a-priori by the learned models, i.e. without input images. Some of them are shown in the 
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Fig. 2. Modelling spatial relations between segments. The first row shows input images and re- 
gions with fixed segmentation. The middle and bottom row show labellings generated by the 
learned a-priori models (segment labels are coded by colour): the images in the middle row were 
generated by the model with full neighbourhood, whereas the images in the bottom row were 
generated by the baseline model. 



second and third row for the model with complex neighbourhood structure and the base- 
line model respectively. It can be seen, that the spatial relations between segments (like 
e.g. "above", "below" etc.) were correctly captured by the complex model, whereas it 
is clearly not the case for the Potts model. 

The consequences can be clearly seen from the following experiment. We fixed the 
prior models obtained in the previous experiment (semi-supervised learning) for both 
variants (the complex prior and the Potts prior) and learned the parameters of the Gaus- 
sian mixtures completely unsupervised. Fig. 3 shows labellings (i.e. segmentations) 
sampled at the end of the learning process by the coiTesponding a-posteriori probabil- 
ity distributions (obtained with the learned appearances) for the complex a-priori model 
and the Potts a-priori model in the first and the second row respectively. The advantages 
of the complex model are clearly seen. These results can be explained as follows. There 
are twelve Gaussians in total to interpret the given images. For the learning process it is 
"hard to decide" which of the Gaussians belongs to which segment. Using the compact- 
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Fig. 3. Segmentation results obtained after fully unsupervised learning of the appearance part of 
the model. Upper row ~ model with full neighbourhood, bottom row ~ baseline model. 



ness assumption only, is obviously not enough to separate segments from each other. 
If the complex model is used instead, the learning process starts to generate labellings 
according to the a-priori probability distribution, i.e. labellings which reflect the coiTect 
spatial relations between the segments. This forces the unsupervised learning of the 
appearance models into the right direction. 



Modelling simple shapes This group of experiments demonstrates the ability of the 
model to represent simple shapes as well as to perform shape driven segmentation. This 
experiment is prototypical e.g. for a class of image recognition tasks in biomedical re- 
search. Fig. 4 (upper left) shows a microscope image of liver cells with stained DNA. 
Thus, only the cell nuclei are visible. The task is to segment the image into two seg- 
ments - "cells" (which have nearly circular shape) and "background" (the rest including 
artefacts). Hence, two labels are used. The "full" neighbourhood structure with d — 12 
was used (it approximately coiTesponds to the mean cell diameter). Again, we used a 
baseline model for comparison - a GRF with 4-neighbourhood and free potentials. The 
appearances for grey-values were assumed to be Gaussian mixtures (two per segment) 
in both models. 

First, semi-supervised learning was performed (like in the previous experiment with 
trees) in order to learn the prior distributions for labellings as well as the appearances for 
both, the complex and the baseline model. A labelling generated a-priori by the learned 
complex model is shown in Fig. 4 (upper right). The final segmentations according to 
the max-marginal decision (see equation (5)) are shown in the bottom row of the same 
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Fig. 4. Modelling and segmentation of simple shapes. Upper left - input image, upper right ~ a 
labelling generated a-priori by the learned complex model. Final segmentations are shown in the 
bottom row: left - baseline model, right - complex model. 



figure. The differences are clearly seen. The shape prior captured in the complex model 
led to the correct segmentation - the artefacts were segmented as background, whereas 
the baseline model produces a wrong segmentation because neither the appearance nor 
a simple "compactness" assumption nor even their combination allow to differentiate 
between cells and artefacts. 

Structure estimation for simple shapes In order to investigate the structure iden- 
tifiability of shape models we have used an artificial model which generates simple 
"blobs". The neighbourhood structure consists of 8 elements. The group of the first 
four elements with coordinates (0, 1), (0, —1), (1, 1) and (—1, 1) describes a standard 
8-neighbourhood. The remaining four vectors are scaled versions of the first (scale fac- 
tor 5). The Gibbs potentials on the short vectors are supermodular and express the cor- 
relation of the labels on the edges of this type 

uik,k')^h ^" = ^''. (16) 

I —a else. 

The Gibbs potentials on the long edges consist of an submodular and a modular part 
u{k, k') = ui{k, k') + U2(fc, fc'), where the submodular part ui is just the negative 
version of the potentials on the short edges and expresses an anti-correlation of the 
labels on these edges. The modular part 




(17) 
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Fig. 5. Shape estimation for a simple shape model. Left - 
right - histogram of the estimated structures. 



labelling generated by the known model, 



is used to influence the density of the blobs. A labelling (fragment) sampled by this 
model (a — 0.35, /3 = 0.5) is shown in Fig. 5. Both heuristic approaches for structure 
estimation discussed in the previous section where applied for the supervised version, 
i.e. using a labelling generated by the known model as a learning sample. 

The first approach - iterative growth of the structure - was run 40 times. The es- 
timated structures resulting from these runs are shown in Fig. 5 as a grey-coded his- 
togram. As a stochastic gradient ascend is used for the learning of the potentials, each 
run may result in a different structure. The histogram shows however, that the structure 
estimation is essentially correct. All trials of the second approach - iterative shrinking 
of the neighbourhood structure - resulted much to our surprise in one and the same 
estimated structure - the coiTect one. 

We conclude from these experiments that the neighbourhood structure of a shape 
model is identifiable (at least in principle) from labellings generated by the model. 



Modelling composite shapes The previous experiments have shown that second order 
GRFs can model both, spatial relations between segments and simple shapes. Now we 
are going to demonstrate the capability of the model to capture both properties simulta- 
neously. This opens the possibility to represent complex shapes as spatial compositions 
of simpler parts. To demonstrate this, we use an artificial example shown in Fig. 6 (up- 
per left). It was produced manually and corrupted by Gaussian noise. Accordingly, the 
model was defined as follows. The label set K consists of seven labels, each one cor- 
responding to a part of the modelled shape (as well as one for the background). The 
appearance models p{c\k) for the labels are Gaussians with known parameters. In this 
experiment we applied the growth variant for the estimation of the interaction structure 
as described in section 2. 

Fig. 6 (upper row, center) shows a labelling generated by the learned piior model. 
It is clearly seen that both, spatial relations between object parts and part shapes are 
captured correctly. 

The bottom row of Fig. 6 displays labellings generated during the process of struc- 
ture learning at time moments, when the interaction structure learned so far was not 
yet capable to capture all needed properties. As it can be seen, the model was able to 
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Fig. 6. Composite shape modelling. Upper row from left: input image, labelling generated a- 
priory by the learned model, estimated interaction structure. Bottom row: labellings generated by 
models during learning. 



learn spatial relations between the segments more or less correctly even for a small 
numbers of edges (5 edges - bottom left). More relations are learned as the number of 
edges grows (bottom middle and right). Finally, 20 difference vectors were necessary 
to capture all relations (out of 1200 possible for the maximal range of d = 24). 

Fig. 6 (upper right) shows the estimated neighbourhood structure. The endpoints of 
all edges from central pixel are marked by colours (the image is magnified for better 
visibility). A certain structure can be seen in this image. The 8-neighbourhood edges 
(black) reflect compactness and adjacency relations of the object parts. The learned po- 
tentials on these edges represent strong label co-occurrences. Most of the other vectors 
are responsible for the shapes of the parts. The potentials on the red edges express char- 
acteristic breadths, and the potentials on the green edges - characteristic lengths of the 
parts. The potentials on these edges mainly represent anti-correlations, forcing label 
values to change along certain directions. The blue pixels in the figure reflect relative 
positions of object parts. 



Composite shape recognition The final experiment demonstrates possibilities to com- 
bine composite shape models. The aim is to obtain a joint model which can be used for 
detection, segmentation and classification of objects in scenes populated by instances 
of different shape classes like e.g. the example in Fig. 7. We conclude from the previous 
experiments, that the appearance model can be re-learned in a fully unsupervised way if 
the prior shape model is discriminative. Hence, the most important question is, how to 
combine the prior models. We propose a method for this that is based on the following 
observation. It is not necessary to have an example image (or an example segmentation) 
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Fig. 7. Shape segmentation and classification. Left - input image, right - segmentation (part- 
labels are encoded by colours). 



in order to learn the model if the aposteriori statistics 

<Pa(fc,fc')=Epfe|B,u)[<Z'a(fc,/c')] (18) 

for all difference vectors a E A and label pairs (fc, k') are known - the gradient of the 
likelihood (equation (8)) reads then 

dL/duaik,k') ^ <Paik,k') -Epf^y.,,)[na{k,k';y)]. (19) 

Let us consider this in a bit more detail for a simple example - just two shapes like 
in Fig. 7. Let us assume that the both models are learned, i.e. both the potentials and 
statistics are known for both models and for all difference vectors a. Obviously, it is 
not easy to combine the potentials of both shape models in order to obtain new ones 
for a model that generates such collages. It is however very easy to estimate the needed 
aposteriori statistics for the joint model given the aposteriori statistics for both shape 
models. Summarizing, the scheme to obtain the parameters of the joint model consists 
of two stages: 

L compute the aposteriori statistics for the joint model and 

2. learn the model according to (19) so that it reproduces this statistics. 

As the second stage is standard, we consider the first one in more detail. Let us denote 
the label sets corresponding to the shape parts by K^ and K^ for the first and for the 
second shape type respectively. Let b^ and b^ be the background labels in the corre- 
sponding shape models and b be the background label in the joint one. Consequently, 
the label set of the latter is K^ U K^ U b (see the middle part of Fig. 8). 

First of all we enlarge the label sets of each shape model by labels that are not 
present in this model but present in the joint one. Thereby the statistics for the new 
introduced labels (for all difference vectors a) are set to zero (see Fig. 8, left and right). 
Informally said, these extended aposteriori statistics correspond to the situations that 
the joint model is learned on examples, in which only labels of one particular shape 
are present. The aposteriori statistics for the joint model is then obtained as a weighted 
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Fig. 8. Estimation of the aposteriori statistics for tlie joint model. Left and right: extended statis- 
tics for shape models. Middle: the joint model - statistics marked green and red are inherited 
from the components. Others are set to a small constant. 



mixture of the two extended ones and an additional uniformly distributed component. 
The latter is added in order to avoid zero probabilities (which would lead to obvious 
technical problems for the Gibbs Sampler). Summarising, the aposteriori statistics of 
label pairs for a difference vector a of the joint model is: 



^a{k,k') 



wi ■'Pl{k,k')+WQ 

W2-^l{k,k')+Wo 






if fee i^i andfc' G K^, 
ke K^ and k' = 6, 
fc^feandfc' e iCS 

if kGK^ and k' eK^, 
keK^ and fc' = fe, 
fc = 6andfc' e K^, 

if k = b and k' = b 
otherwise. 



(20) 



with some weights wq ^ wi ~ W2, where the indices 1 and 2 correspond to the 
particular shape model. Given these statistics the joint model is learned according to 
(19). 

For the experiment in Fig. 7 two composite shape models were learned separately. 
The test image in Fig. 7 (left) is a collage of both shape types. Note that the appearance 
of all shape parts is identical, so they are not distinguishable without the prior shape 
model. Fig. 7 (right) shows the final segmentation. It is seen that all objects were cor- 
rectly segmented and recognised - although both composite shape classes share some 
similarly shaped parts - they were not confused. 



4 Conclusions 

The notation of shape is often understood as an object property of global nature. We fol- 
lowed a different direction by modelling shapes in a distributed way. We have demon- 
strated that the expressive power of second order GRFs allows to model spatial relations 
of segments, simple shapes and moreover, both aspects simultaneously i.e. composite 
shapes which are understood as coherent spatial compositions of simpler shape parts. 
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We have shown that complex shapes can be recognized even in the situation, when 
their parts are not distinguishable by appearance. However, in our learning experiments 
we used training images, where they are distinguishable. Thus, an important question is, 
whether it is possible to perform unsupervised decomposition of complex shapes into 
simpler parts during the learning phase, i.e. to learn shape models from images, where 
the desired spatial relations between shape parts are not explicitly present. Another 
important issue is the learning of the interaction structure. It would be very useful to 
have a well grounded approach for this. 
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A Equivalent transforms for homogeneously parametrised GRFs 

As we have already seen, the probability distribution (1) for shape part labellings y can 
be equivalently written as 



p{y) ^cxp|^2^uo(fc)no(fc;y) + 2^ 2^Ua{k,k')na{k,k;y) 

k a^A' kk' 



(21) 



where A' ^ A\ {0}. We call two parametrisations u, u equivalent, if the correspond- 
ing probability distributions are identical. It follows that the difference v ^ u — u of 
equivalent potentials fulfils 

y{y) = X! Mk)noik; y) + X! X! "«(^' ^')"-a(fc, k'; y) = const. (22) 

k afEA' kk' 

We will conclude that all functions Va are constant under fairly general conditions. We 
perform the proof in two steps. First we show that the pairwise functions Va, a ^ 
are modular and can be written as a sum of unary functions. In a second step we will 
conclude the claimed statement under fairly general conditions for the graph {D,E). 

Let us consider an arbitrary non-zero vector a E Aof the neighbourhood structure 
and an arbitrary edge {tt') £ Ea- Let fci, fc2 be two arbitrary labels in the node t and 
k'i,k'2 be two arbitrary labels in the node t' . Let 2/11,2/12,2/21,2/22 be four labellings 
with respective values (fci, k[), (fci, k'2), (fe, fc'i), (^2, k'2) on the nodes t, t' such that 
they coincide on all other vertices. We consider the equation 

V{yu) + V{y22) - V(yi2) - V(y2i) = 0. (23) 

It is easy to see that this equation reduces to 

Va{kl, k[) + Va{k2, fc^) - Wa(fcl, ^2) " «a(fc2, ^i) = 0. (24) 

This holds for arbitrary four-tuples of labels and it follows that the function Va is mod- 
ular and can be written as a sum of two unary functions 

Va{k,k')^i}a{k)+V-a{k'). (25) 

These arguments can be applied for every element a G A'. Consequently, V{y) can be 
written as 

V{y) = X! '"o{k)no{k; y) + X! X! [va{k)na{k; y) + V-a{k)n-a{k; y)] , (26) 

k aeA' k 

where we have omitted the tildes. Note that na{k; y) — J^k ^a{k, k'; y) denotes the 
number of vertices with an outgoing edge of type a for which the labelling y has the 
value k. Therefore in general no{k; y) ^ na{k; y). 
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Let us consider an arbitrary vertex t and two labellings y, y which coincide on all 
vertices but t. It follows from V{y) — V{y) =0 that 



vo 



(fc)+ Y^ Va{k)+ Y^ w_„(fc) = const. (27) 



t+aeD t-aeD 



We assign a vector z{t) with dimension 2| A| — 1 to every vertex t E D with components 

Zo{t) = 1, Za{t) = <^ ^ , 'and z_,(i) = <^ ^ , (28) 

I else I else. 

If the domain D contains a subset of nodes t such that their vectors z{t) span the whole 
vector space of dimension 2\A\ — 1, then, clearly, considering equation (27) for each of 
them, we obtain 

wo(fc) =const. (29) 

Va{k) =const. (30) 

V-a{k) =const. (31) 

for all a e A. D 



